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Toward a self-organizing pre-symbolic neural model 
representing sensorimotor primitives 
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The acquisition of symbolic and linguistic representations of sensorimotor behavior is 
a cognitive process performed by an agent when it is executing and/or observing 
own and others' actions. According to Piaget's theory of cognitive development, these 
representations develop during the sensorimotor stage and the pre-operational stage. 
We propose a model that relates the conceptualization of the higher-level information 
from visual stimuli to the development of ventral/dorsal visual streams. This model 
employs neural network architecture incorporating a predictive sensory module based on 
an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product 
model. We exemplify this model through a robot passively observing an object to learn 
its features and movements. During the learning process of observing sensorimotor 
primitives, i.e., observing a set of trajectories of arm movements and its oriented object 
features, the pre-symbolic representation is self-organized in the parametric units. These 
representational units act as bifurcation parameters, guiding the robot to recognize and 
predict various learned sensorimotor primitives. The pre-symbolic representation also 
accounts for the learning of sensorimotor primitives in a latent learning context. 
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1. INTRODUCTION 

Although infants are not supposed to acquire the symbolic rep- 
resentational system at the sensorimotor stage, based on Piaget's 
definition of infant development, the preparation of language 
development, such as a pre-symbolic representation for con- 
ceptualization, has been set at the time when the infant starts 
babbling (Mandler, 1999). Experiments have shown that infants 
have established the concept of animate and inanimate objects, 
even if they have not yet seen the objects before (Gelman and 
Spelke, 1981). Similar phenomena also include the conceptual- 
ization of object affordances such as the conceptualization of 
containment (Bonniec, 1985). This conceptualization mechanism 
is developed at the sensorimotor stage to represent sensorimotor 
primitives and other object-affordance related properties. 

During an infants' development at the sensorimotor stage, one 
way to learn affordances is to interact with objects using tac- 
tile perception, observe the object from visual perception and 
thus learn the causality relation between the visual features, affor- 
dance and movements as well as to conceptualize them. This 
learning starts with the basic ability to move an arm toward the 
visual-fixated objects in new-born infants (Von Hofsten, 1982), 
continues through object- directed reaching at the age of 4 months 
(Streri et al., 1993; Corbetta and Snapp-Childs, 2009), and can 
also be found during the object exploration of older infants 
(c.f. Ruff, 1984; Mandler, 1992). From these interactions lead- 
ing to visual and tactile percepts, infants gain experience through 
the instantiated "bottom- up" knowledge about object affordances 
and sensorimotor primitives. Building on this, infants at the age 
of around 8-12 months gradually expand the concept of object 



features, affordances and the possible causal movements in the 
sensorimotor context (Gibson, 1988; Newman et al., 2001; Rocha 
et al, 2006). For instance, they realize that it is possible to pull 
a string that is tied to a toy car to fetch it instead of crawling 
toward it. An associative rule has also been built that connects 
conceptualized visual feature inputs, object affordance and the 
corresponding frequent auditory inputs of words, across various 
contexts (Romberg and Saffran, 2010). At this stage, categories of 
object features are particularly learned in different contexts due 
to their affordance-invariance (Bloom et al, 1993). 

Therefore the integrated learning process of the object's fea- 
tures, movements according to the affordances, and other knowl- 
edge is a globally conceptualized process through visual and 
tactile perception. This conceptualized learning is a precursor 
of a pre-symbolic representation of language development. This 
learning is the process to form an abstract and simplified repre- 
sentation for information exchange and sharing 1 . To conceptual- 
ize from visual perception, it usually includes a planning process: 
first the speaker receives and segments visual knowledge in the 
perceptual flow into a number of states on the basis of differ- 
ent criteria, then the speaker selects essential elements, such as 
the units to be verbalized, and last the speaker constructs certain 
temporal perspectives when the events have to be anchored and 
linked (c.f. Habel and Tappe, 1999; von Stutterheim and Nuse, 
2003). Assuming this planning process is distributed between 
ventral and dorsal streams, the conceptualization process should 



For comparison of conceptualization between engineering and language 
perspectives, see (Gruber and Olsen, 1994; Bowerman and Levinson, 2001). 
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also emerge from the visual information that is perceived in each 
stream, associating the distributed information in both streams. 
As a result, the candidate concepts of visual information are sta- 
tistically associated with the input stimuli. For instance, they may 
represent a particular visual feature with a particular class of label 
(e.g., a particular visual stimuli with an auditory wording "cir- 
cle") (Chemla et al, 2009). Furthermore, the establishment of 
such links also strengthens the high-order associations that gener- 
ate predictions and generalize to novel visual stimuli (Yu, 2008). 
Once the infants have learned a sufficient number of words, 
they begin to detect a particular conceptualized cue with a spe- 
cific kind of wording. At this stage, infants begin to use their 
own conceptualized visual "database" of known words to iden- 
tify a novel meaning class and possibly to extend their wording 
vocabulary (Smith et al., 2002). Thus, this associative learning 
process enables the acquisition and the extension of the concepts 
of domain-specific information (e.g., features and movements in 
our experiments) with the visual stimuli. 

This conceptualization will further result in a pre-symbolic 
way for infants to communicate when they encounter a conceptu- 
alized object and intend to execute a correspondingly conceptu- 
alized well-practised sensorimotor action toward that object. For 
example, behavioral studies showed that when 8-to- 11 -month- 
old infants are unable to reach and pick up an empty cup, they 
may point it out to the parents and execute an arm movement 
intending to bring it to their lips. The conceptualized shape of 
a cup reminds infants of its affordance and thus they can com- 
municate in a pre-symbolic way. Thus, the emergence from the 
conceptualized visual stimuli to the pre-symbolic communication 
also gives further rise to the different periods of learning nouns 
and verbs in infancy development (c.f. Gentner, 1982; Tardif, 
1996; Bassano, 2000). This evidence supports that the produc- 
tion of verbs and nouns are not correlated to the same modality 
in sensory perception: experiments performed by Kersten (1998) 
suggest that nouns are more related to the movement orientation 
caused by the intrinsic properties of an object, while verbs are 
more related to the trajectories of an object. Thus we argue that 
such differences of acquisitions in lexical classes also relate to the 
conceptualized visual ventral and dorsal streams. The finding is 
consistent with Damasio and Tranel (1993)'s hypothesis that verb 
generation is modulated by the perception of conceptualization 
of movement and its spatio-temporal relationship. 

For this reason, we propose that the conceptualized visual 
information, which is a prerequisite for the pre-symbolic commu- 
nication, is also modulated by perception in two visual streams. 
Although there have been studies of modeling the functional 
modularity in the development of ventral and dorsal streams (e.g., 
Jacobs et al, 1991; Mareschal et al, 1999), the bilinear models 
of visual routing (e.g., Olshausen et al, 1993; Memisevic and 
Hinton, 2007; Bergmann and von der Malsburg, 2011), in which 
a set of control neurons dynamically modifies the weights of the 
"what" pathway on a short time scale, or transform- invar iance 
models (e.g., Foldiak, 1991; Wiskott and Sejnowski, 2002) by 
encouraging the neurons to fire invariantly while transforma- 
tions are performed in their input stimuli. However, a model that 
explains the development of conceptualization from both streams 
and results in an explicit representation of conceptualization of 



both streams while the visual stimuli is presented is still missing 
in the literature. This conceptualization should be able to encode 
the same category for information flows in both ventral and dor- 
sal streams like "object files" in the visual understanding (Fields, 
2011) so that they could be discriminated in different contexts 
during language development. 

On the other hand, this conceptualized representation that is 
distributed in two visual streams is also able to predict the ten- 
dency of appearance of an action- oriented object in the visual 
field, which causes some sensorimotor phenomena such as object 
permanence (Tomasello and Farrar, 1986) showing the infants' 
attention usually is driven by the object's features and movements. 
For instance, when infants are observing the movement of the 
object, recording showed an increase of the looking times when 
the visual information after occlusion is violated in either sur- 
face features or location (Mareschal and Johnson, 2003). Also the 
words and sounds play a top-down role in the early infants' visual 
attention (Sloutsky and Robinson, 2008). This could hint at the 
different development stages of the ventral and dorsal streams 
and their effect on the conceptualized prediction mechanism in 
the infant's consciousness. Accordingly, the model we propose 
about the conceptualized visual information should also be able 
to explain the emergence of a predictive function in the sen- 
sorimotor system, e.g., the ventral stream attempts to track the 
object and the dorsal stream processes and predicts the object's 
spatial location, when the sensorimotor system is involved in 
an object interaction. We have been aware of that this build-in 
predictive function in a forward sensorimotor system is essen- 
tial: neuroimaging research has revealed the existence of internal 
forward models in the parietal lobe and the cerebellum that 
predict sensory consequences from efference copies of motor 
commands (Kawato et al, 2003) and supports fast motor reac- 
tions (e.g., Hollerbach, 1982). Since the probable position and the 
movement pattern of the action should be predicted on a short 
time scale, sensory feedback produced by a forward model with 
negligible delay is necessary in this sensorimotor loop. 

Particularly, the predictive sensorimotor model we propose 
is suitable to work as one of the building modules that takes 
into account the predictive object movement in a forward sen- 
sorimotor system to deal with object interaction from visual 
stimuli input as Figure 1 shows. This system is similar to Wolpert 
et al. (1995)'s sensorimotor integration, but it includes an addi- 
tional sensory estimator (the lower brown block) which takes into 
account the visual stimuli from the object so that it is able to 
predict the dynamics of both the end- effector (which is accom- 
plished by the upper brown block) and the sensory input of 
the object. This object-predictive module is essential in a sen- 
sorimotor system to generate sensorimotor actions like tracking 
and avoiding when dealing with fast-moving objects, e.g., in 
ball sports. We also assert that the additional inclusion of for- 
ward models in the visual perception of the objects can explain 
some predictive developmental sensorimotor phenomena, such 
as object permanence. 

In summary, we propose a model that establishes links between 
the development of ventral/dorsal visual streams and the emer- 
gence of the conceptualization in visual streams, which further 
leads to the predictive function of a sensorimotor system. To 
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validate this proof- of- concept model, we also conducted exper- 
iments in a simplified robotics scenario. Two NAO robots were 
employed in the experiments: one of them was used as a "pre- 
senter" and moved its arm along pre-programmed trajectories as 
motion primitives. A ball was attached at the end of the arm so 
that another robot could obtain the movement by tracking the 
ball. Our neural network was trained and run on the other NAO, 
which was called the "observer". In this way, the observer robot 
perceived the object movement from its vision passively, so that 
its network took the object's visual features and the movements 
into account. Though we could also use one robot and a human 
presenter to run the same tasks, we used two identical robots, due 
to the following reasons: (1) the object movement trajectories 
can be done by a pre-programmed machinery so that the types 
and parameters of it can be adjusted; (2) the use of two iden- 
tical robots allows to interchange the roles of the presenter and 
observer in an easier manner. As other humanoid robots, a senso- 
rimotor cycle that is composed of cameras and motors also exists 
in NAO robots. Although its physical configurations and param- 
eters of sensory and motor systems are different from those in 
human beings' or other biological systems, our model only han- 
dles the pre-processed information extracted from visual stimuli. 
Therefore it is sufficient to serve as a neural model that is run- 
ning in a robot CPU to explain the language development in the 
cortical areas. 

2. MATERIALS AND METHODS 
2.1. NETWORK MODEL 

A similar forward model exhibiting sensory prediction for visual 
object perception has been proposed in our recently published 
work (Zhong et al, 2012b) where we suggested an RNN imple- 
mentation of the sensory forward model. Together with a CACLA 
trained multi-layer network as a controller model, the forward 
model embodied in a robot receiving visual landmark percepts 
enabled a smooth and robust robot behavior. However, one 
drawback of this work was its inability to store multiple sets of 
spatial-temporal input-output mappings, i.e., the learning did 
not converge if there appeared several spatial-temporal mapping 




sequences in the training. Consequently, a simple RNN network 
was not able to predict different sensory percepts for different 
reward-driven tasks. Another problem was that it assumed only 
one visual feature appeared in the robot's visual field, and that was 
the only visual cue it could learn during development. To solve 
the first problem, we further augment the RNN with paramet- 
ric bias (PB) units. They are connected like ordinary biases, but 
the internal values are also updated through back-propagation. 
Comparing to the generic RNN, the additional PB units in this 
network act as bifurcation parameters for the non-linear dynam- 
ics. According to Cuijpers et al. (2009), a trained RNNPB can 
successfully retrieve and recognize different types of pre-learned, 
non-linear oscillation dynamics. Thus, this bifurcation function 
can be regarded as an expansion of the storage capability of 
working-memory within the sensory system. Furthermore, it 
adds the generalization ability of the PB units, in terms of recog- 
nizing and generating non-linear dynamics. To tackle the second 
problem, in order to realize sensorimotor prediction behaviors 
such as object permanence, the model should be able to learn 
objects' features and object movements separately in the ven- 
tral and dorsal visual streams, as we have shown in Zhong et al. 
(2012a). 

Merging these two ideas, in the context of sensorimotor inte- 
gration in hand- object interaction, the PB units can be considered 
as a small set of high-level conceptualized units that describe 
various types of non-linear dynamics of visual percepts, such 
as features and movements. This representation is more related 
to the "natural prototypes" from visual perception, for instance, 
than a specific language representation (Rosch, 1973). 

The development of PB units can also be seen as the pre- 
symbolic communication that emerges during sensorimotor 
learning. The conceptualization, on the other hand, could also 
result in the prediction of future visual percepts of moving objects 
in sensorimotor integration. 

In this model (Figure 2), we propose a three-layer, horizon- 
tal product Elman network with PB units. Similar to the original 
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FIGURE 1 | Diagram of sensorimotor integration with the object 
interaction. The lower forward model predicts the object movement, while 
the upper forward model extracts the end-effector movement from sensory 
information in order to accomplish a certain task (e.g., object interaction). 



dorsal-like 
layer 



FIGURE 2 | The RNNPB-horizontal network architecture, where /(layers 
represent k different types of features. Size of M indicates the 
transitional information of the object. 
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RNNPB model, the network is capable of being executed under 
three running modes, according to the pre -known conditions 
of inputs and outputs: learning, recognition and prediction. In 
learning mode, the representation of object features and move- 
ments are first encoded in the weights of both streams, while 
the bifurcation parameters with a smaller number of dimensions 
are encoded in the PB units. This is consistent with the emer- 
gence of the conceptualization at the sensorimotor stage of infant 
development. 

Apart from the PB units, another novelty in the network is that 
the visual object information is encoded in two neural streams 
and is further conceptualized in PB units. Two streams share the 
same set of input neurons, where the coordinates of the object 
in the visual field are used as identities of the perceived images. 
The appearance of values in different layers represents different 
visual features: in our experiment, the color of the object detected 
by the yellow filter appears in the first layer whereas the color 
detected by the green filter appears in the second layer; the other 
layer remains zero. For instance, the input ((0,0), (x,y)) rep- 
resents a green object at (x, y) coordinates in the visual field. 
The hidden layer contains two independent sets of units repre- 
senting dorsal-like "d" and ventral-like "v" neurons, respectively. 
These two sets of neurons are inspired by the functional proper- 
ties of dorsal and ventral streams: (1) fast responding dorsal-like 
units predict object position and hence encode movements; (2) 
slow responding ventral-like units represent object features. The 
recurrent connection in the hidden layers also helps to predict 
movements in layer d and to maintain a persistent representation 
of an object's feature in layer v. The horizontal product brings 
both pathways together again in the output layer with one -step 
ahead predictions. Let us denote the output layer's input from 
layer d and layer v as x d and x v , respectively. The network output 
s° is obtained via the horizontal product as 



x d Qx v 



(1) 



where O indicates element-wise multiplication, so each pixel is 
defined by the product of two independent parts, i.e., for output 



unit k it is s° k 



2.2. NEURAL DYNAMICS 

We use s b (t) to represent the activation and PB d ^ v (t) to represent 
the activation of the dorsal/ventral PB units at the time-step t. 
In some of the following equations, the time-index t is omitted 
if all activations are from the same time -step. The inputs to the 
hidden units yj in the ventral stream and y d in the dorsal stream 
are defined as 



vf lf and vjj, indicate the recurrent weighting matrices within the 
hidden layers. 

The transfer functions in both hidden layers and the PB units 
all employ the sigmoid function recommended by LeCun et al. 
(1998), 



8% = 1.7159 -tanfcl^ 



2 d/v 

yi/j 



VB d J v , = 1.7159 -tan/z 

ni/«2 



2 d/v \ 

3 Pm/n 2 J 



(4) 
(5) 



where p d ^ v represent the internal values of the PB units. 

The terms of the horizontal products of both pathways can be 
presented as follows: 



v v v d d . 



(6) 



The output of the two streams composes a horizontal product for 
the network output as we defined in Equation (1). 

2.2.1. Learning mode 

The training progress is basically determined by the cost function: 



C: 



JEE(4(^+i)-» 2 



(7) 



where s^(t + 1) is the one-step ahead input (as well as the desired 
output), s?(t) is the current output, T is the total number of avail- 
able time-step samples in a complete sensorimotor sequence and 
N is the number of output nodes, which is equal to the number 
of input nodes. Following gradient descent, each weight update 
in the network is proportional to the negative gradient of the cost 
with respect to the specific weight w that will be updated: 



Awn 



dC 
1 dwij 



(8) 



where r\jj is the adaptive learning rate of the weights between 
neuron i and;, which is adjusted in every epoch (Kleesiek et al, 
2013). To determine whether the learning rate has to be increased 
or decreased, we compute the changes of the weight wq in 
consecutive epochs: 



dc dc 

(e-l)~ (e) 



(9) 



yf(t) = J2 S h i{t)w d u + J2 4^ - l ) y W + PB ni Wtfn 2 (2) The U P date ° f the learnin g rate is 

i V n 2 

i f "i 

where wfp w v - { represent the weighting matrices between dor- 
sal/ventral layers and the input layer, w^, w v - represent the weight- 
ing matrices between PB units and the two hidden layers, and 



mm(r\ij(e - 1) • yi max ) if erg > 0, 
max(ri;j(e - 1) • [=~, Yi min ) if erg < 0, 
r)ij(e — 1) else. 



where ^ + > 1 and i= < 1 represent the increasing/decreasing 
rate of the adaptive learning rates, with r| m i n and r] max as lower 
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and upper bounds, respectively. Thus, the learning rate of a 
particular weight increases by ^ + to speed up the learning when 
the changes of that weight from two consecutive epochs have the 
same sign, and vice versa. 

Besides the usual weight update according to back- 
propagation through time, the accumulated error over the 
whole time-series also contributes to the update of the PB units. 
The update for the i-th unit in the PB vector for a time-series of 
length T is defined as: 



Pi(e+ 1) = Qi(e) + Yi^i 



PB 



(10) 



where § PB is the error back-propagated to the PB units, e is eth 
time-step in the whole time-series (e.g., epoch), yi is PB units' 
adaptive updating rate which is proportional to the absolute mean 
value of the back-propagation error at the i-th PB node over the 
complete time-series of length T: 



Yi oc 



(id 



The reason for applying the adaptive technique is that it was real- 
ized that the PB units converge with difficulty. Usually a smaller 
learning rate is used in the generic version of RNNPB to ensure 
the convergence of the network. However, this results in a trade- 
off in convergence speed. The adaptive learning rate is an efficient 
technique to overcome this trade-off (Kleesiek et al., 2013). 

2.2.2. Recognition mode 

The recognition mode is executed with a similar information 
flow as the learning mode: given a set of the spatio-temporal 
sequences, the error between the target and the real output is 
back-propagated through the network to the PB units. However, 
the synaptic weights remain constant and only the PB units will be 
updated, so that the PB units are self- organized as the pre-trained 
values after certain epochs. Assuming the length of the observed 
sequence is a, the update rule is defined as: 



Pi(*+1) 



PiOO + Y 

t=T- 



*PB 



(12) 



where h PB is the error back-propagated from a certain sensory 
information sequence to the PB units and y is the updating rate 
of PB units in recognition mode, which should be larger than the 
adaptive rate y z - at the learning mode. 

2.2.3. Prediction mode 

The values of the PB units can also be manually set or obtained 
from recognition, so that the network can generate the upcoming 
sequence with one-step prediction. 

3. RESULTS 

In this experiment, as we introduced, we examined this net- 
work by implementing it on two NAO robots. They were placed 



face-to-face in a rectangle box of 61.5 cm x 19.2 cm as shown 
in Figure 3. These distances were carefully adjusted so that the 
observer was able to keep track of movement trajectories in its 
visual field during all experiments using the images from the 
lower camera. The NAO robot has two cameras. We use the lower 
one to capture the images because its installation angle is more 
suitable to track the balls when they are held in the other NAO's 
hand. 

Two 3.8 cm diameter balls with yellow/green color were used 
for the following experiments. The presenter consecutively held 
each of the balls to present the object interaction. The origi- 
nal image, received from the lower camera of the observer, was 
pre-processed with thresholding in HSV color-space and the 
coordinates of its centroid in the image moment were calculated. 
Here we only considered two different colors as the only feature 
to be encoded in the ventral stream, as well as two sets of move- 
ment trajectories encoded in the dorsal stream. Although we have 
only tested a few categories of trajectories and features, we believe 
the results can be extrapolated to multiple categories in future 
applications. 

3.1. LEARNING 

The two different trajectories are defined as below, 
The cosine curve, 



x = 12 



(13) 




FIGURE 3 | Experimental Scenario: two NAOs are standing face-to-face 
with in a rectangle box. 



Frontiers in Behavioral Neuroscience 



www.frontiersin.org 



February 2014 | Volume 8 | Article 22 | 5 



Zhong et al. 



A self-organizing pre-symbolic neural model 



7 = 8- (-0+0.04 
z = 4-cos(2f) +0.10 



(14) 
(15) 



stream, was approximately self- organized with the color informa- 
tion, while the second PB unit, along with the ventral stream, was 
self- organized with the movement information. 



and the square curve, 
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(16) 



(17) 
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_3rt 
4 
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< t< f 

:t< 3 f 

4 



(18) 



where the 3-dimension tuple (%, y, z) are the coordinates (cen- 
timeters) of the ball w.r.t the torso frame of the NAO presenter. 
t loops between (— 7t, 7t]. In each loop, we calculated 20 data 
points to construct trajectories with 4 s sleeping time between 
every two data points. Note that although we have defined the 
optimal desired trajectories, the arm movement was not ideally 
identical to the optimal trajectories due to the noisy position con- 
trol of the end- effector of the robot. On the observer side, the 
(x, y) coordinates of the color- filtered moment of the ball in the 
visual field were recorded to form a trajectory with sampling time 
of 0.2 s. Five trajectories, in the form of tuple (x, y, z) w.r.t the 
torso frame of the NAO observer were recorded with each color 
and each curve, so total 20 trajectories were available for training. 

In each training epoch, these trajectories, in the form of tuples, 
were fed into the input layer one after another for training, with 
the tuples of the next time-step serving as a training target. The 
parameters are listed in Table 1 . The final PB values were exam- 
ined after the training was done, and the values were shown in 
Figure 4. It can be seen that the first PB unit, along with the dorsal 



Table 1 | Network parameters. 



Parameters 


Parameter's descriptions 


Value 


'H ventral 


Learning rate in ventral stream 


1.0 x 10" 5 


'H dorsal 


Learning rate in dorsal stream 


1.0 x 1CT 3 


^lmax 


Maximum value of learning rate 


1.0 x 1CT 1 


^lmin 


Minimum value of learning rate 


1.0 x 1CT 7 


My 


Proportionality constant of PB 


1.0 x 1CT 2 




units updating rate 






Size of PB unit 1 


1 


n 2 


Size of PB unit 2 


1 


n v 


Size of ventral-like layer 


50 


n d 


Size of dorsal-like layer 


50 


r 


Decreasing rate of learning rate 


0.999999 




Increasing rate of learning rate 


1.000001 



3.2. RECOGNITION 

Another four trajectories were presented in the recognition exper- 
iment, in which the length of the sliding-window is equal to the 
length of the whole time-series, i.e., T = a in Equation (12). The 
update of the PB units were shown in Figure 5. Although we used 
the complete time-series sequence for the recognition, it should 
also be possible to use only part of the sequence, e.g., through the 
sliding-window approach with a smaller number of a to fulfil the 
real-time requirement in the future. 

3.3. PREDICTION 

In this simulation, the obtained PB units from the previous recog- 
nition experiment were used to generate the predicted move- 
ments using the prior knowledge of a specific object. Then, the 
one-step prediction from the output units were again applied to 
the input at the next time-step, so that the whole time-series cor- 
responding to the object's movements and features were obtained. 
Figure 6 presents the comparisons between the true values (the 
same as used in recognition) and the predicted ones. 

From Figure 6 and Table 2, it can be observed that the esti- 
mation was biased quite largely to the true value within the 
first few time- steps, as the RNN needs to accumulate enough 
input values to access its short-term memory. However, the error 
became smaller and it kept track of the true value in the following 
time-steps. Considering that the curves are automatically gener- 
ated given the PB units and the values at the first time-step, the 
error between the true values and the estimated ones are accept- 
able. Moreover, this result show clearly that the conceptualization 
affects the (predictive) visual perception. 
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FIGURE 4 | Values of two sets of PB units in the two streams after 
training. The square markers represent those PB units after the square 
curves training and the triangle markers represent those of the cosine 
curves training. The colors of the markers, yellow and green, represent the 
colors of the balls used for training. 
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FIGURE 5 | Update of the PB values while executing the recognition mode. (A) PB value 1. (B) PB value 2 




3.4. GENERALIZATION IN RECOGNITION 

To testify whether our new computational model has the gener- 
alization ability as Cuijpers et al. (2009) proposed, we recorded 
another set of sequences of a circle trajectory. The trajectory is 
defined as: 

x = 12 (19) 



y = 4- sin(2f) + 0.04 (20) 
z = 4-cos(2f) + 0.10 (21) 

The yellow and green balls were still used. We ran the recognition 
experiment again with the weight previously trained. The update 
of the PB units were shown in Figure 7. Comparing to Figure 4, 
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we can observe that the positive and negative signs of PB val- 
ues are similar as the square trajectory. This is probably because 
the visual perception of circle and square movements have more 
similarities than those between circle and cosine movements. 

3.5. PB REPRESENTATION WITH DIFFERENT SPEEDS 

We further generated 20 trajectories with the same data func- 
tions (Equations 13-18) but with a slower sampling time. In other 
words, the movement of the balls seemed to be faster with robot's 
observation. The final PB values after training were shown in 
Figure 8. 

It can be seen that generally the PB values were smaller com- 
paring to Figure 4, which was probably because there was less 
error being propagated during training. Moreover, the corre- 
sponding PB values corresponding to colors (green and yellow) 
and movements (cosine and square) were interchanged within 
the same PB unit (i.e., along the same axis) due to the differ- 
ence of random initial parameters of the network. But the PB 
unit along with the dorsal stream still encoded color information, 
while the PB unit along with ventral stream encoded movement 
information. The network was still able to show properties of 
spatio-temporal sequences data in the PB units' representation. 

4. DISCUSSION 

4.1. NEURAL DYNAMICS 

An advancement of the HP-RNN model is that it can learn and 
encode the "what" and "where" information separately in two 
streams (more specifically, in two hidden layers). Both streams 
are connected through horizontal products, which means fewer 
connections than full multiplication (as the conventional bilinear 

Table 2 | Prediction error. 

Error of outputs Unit 1 Unit 2 Unit 3 Unit 4 



Cosine, yellow 
Cosine, green 
Square, yellow 
Square, green 



2.28 x 1CT 4 
8.34 x 1CT 4 
3.91 x 1CT 4 
1.40 x 1CT 3 



8.09 x 10" 5 
7.04 x 1CT 4 
9.64 x 1CT 5 
3.27 x 1CT 4 



7.29 x 1CT 4 
1.50 x 1CT 4 
1.74 x 10" 3 
3.54 x 1CT 4 



8.63 x 10" 4 
2.01 x 1CT 4 
3.23 x 1CT 4 
2.60 x 1CT 4 



model) (Zhong et al., 2012a). In this paper, we further augmented 
the HP-RNN with the PB units. One set of units, connecting 
to one visual stream, reflects the dynamics of sequences in the 
other stream. This is an interesting result since it shows the neu- 
ral dynamics in the hybrid combination of the RNNPB units and 
the horizontal product model. Taking the dorsal-like hidden layer 
for example, the error of the attached PB units is 



E^-4V(4)4m) 



(22) 



(23) 



where andf(-) are the derivatives of the linear and sigmoid 
transfer functions. Since we have the linear output, according to 
the definition of the horizontal product, the equation becomes, 



C = E 



£(4 -4) ©4 



(24) 



The update of the internal values of the PB units becomes 

T 

p- 2 (e+l) = pV(e)+Y C« 

t=T-a 
T 



(25) 



t=T-a 



E 



E(4«-4«)o*£(*) 



/ L k 



« 9 (0)-< 2 « 



(26) 



where the x v k (t) term refers to the contribution of the weighted 
summation from the ventral-like layer at time t. Note that the 
term/^pj^ (t)) is actually constant within one epoch and it is only 
updated after each epoch with a relatively small updating rate. 
Therefore, from the experimental perspective, given the same 
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FIGURE 7 | Update of the PB values while executing the recognition mode with an untrained feature (circle). (A) PB value 1. (B) PB value 2. 
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FIGURE 8 | Values of two sets of PB units in the two streams after 


training with faster speed. The representation of the markers is the same 


as Figure 4. 









object movement but different object features, the difference of 
the PB values mostly reflects the dynamic changes in the hid- 
den layer of the ventral stream. The same holds for the PB units 
attached to the ventral-like layer. This brief analysis shows the 
PB units for one modularity in RNNPB networks with horizon- 
tal product connections, effectively accumulating the non-linear 
dynamics of other modularities. 

4.2. CONCEPTUALIZATION IN VISUAL PERCEPTION 

The visual conceptualization and perception are intertwined pro- 
cesses. As experiments from Schyns and Oliva (1999) show, when 
the visual observation is not clear, the brain automatically extrap- 
olates the visual percept and updates the categorization labels 
on various levels according to what has been gained from the 
visual field. On the other hand, this conceptualization also affects 
the immediate visual perception in a top-down predictive man- 
ner. For instance, the identity conceptualization of a human face 
predictively spreads conceptualizations in other levels (e.g., face 
emotion). This top-down process propagates from object iden- 
tity to other local conceptualizations, such as object affordance, 
motion, edge detection and other processes at the early stages of 
visual processing. This can be tested by classic illusions, such as 
"the goblet illusion," where perception depends largely on top- 
down knowledge derived from past experiences rather than direct 
observation. This kind of illusion may be explained by the error 
in the first few time steps of the prediction experiment of our 
model. Therefore, our model to some extent also demonstrates 
the integrated process between the conceptualization and the 
spatio-temporal visual perception. This top-down predictive per- 
ception may also arouse other visual based predictive behaviors 
such as object permanence. 

Particularly, the PB units act as a high-level conceptualization 
representation, which is continuously updated with the partial 
sensory information perceived in a short-time scale. The pre- 
diction process of the RNNPB is assisted by the conceptualized 



PB units of visual perception, which is identical to the integra- 
tion conceptualization and (predictive) visual perception. This 
is the reason why PB units were not processed as a binary 
representation, as Ogata et al. (2007) did for human-robot- 
interaction; the original values of PB units are more accurate 
in generating the prediction of the next time-step and per- 
forming generalization tasks. As we mentioned, this model is 
merely a proof- of- concept model that bridges conceptualized 
visual streams and sensorimotor prediction. For more com- 
plex tasks, besides expanding of the network size as we men- 
tioned, more complex networks that are capable of extracting 
and predicting higher-level spatio-temporal structures (e.g., pre- 
dictive recurrent networks owning large learning capacity by 
Tani and colleagues: Yamashita and Tani, 2008; Murata et al, 
2013) can be also applied. It should be interesting to further 
investigate the functional modularity representation of these 
network models when they are interconnected with horizontal 
product too. 

Furthermore, the neuroscience basis that supports this paper, 
in the context of the mirror neuron system based on object- 
oriented- actions (grasping), can be stated as the "data-driven" 
models such as MNS (Oztop and Arbib, 2002) and MNS2 
(Bonaiuto et al., 2007; Bonaiuto and Arbib, 2010), although 
the main hypothesis in our model is not taken from the mir- 
ror neuron system theory. In the MNS review paper by Oztop 
et al. (2006), the action generation mode of the RNNPB model 
was considered to be excessive as there has no evidence yet to 
show that the mirror neuron system participates in action gen- 
eration. However, in our model the generation mode has a key 
role of conceptualized PB units in the sensorimotor integration 
of object interaction. Nevertheless, the similar network architec- 
ture (RNNPB) used in modeling mirror neurons (Tani et al., 
2004) and our pre-symbolic sensorimotor integration models 
may imply a close relationship between language (pre-symbolic) 
development, object-oriented actions, and the mirror neuron 
theory. 

5. CONCLUSION 

In this paper a recurrent network architecture integrating the 
RNNPB model and the horizontal product model has been pre- 
sented, which sheds light on the feasibility of linking the con- 
ceptualization of ventral/ dorsal visual streams, the emergence 
of pre-symbol communication, and the predictive sensorimotor 
system. 

Based on the horizontal product model, here the informa- 
tion in the dorsal and ventral streams is separately encoded 
in two network streams and the predictions of both streams 
are brought together via the horizontal product while the PB 
units act as a conceptualization of both streams. These PB units 
allow for storing multiple sensory sequences. After training, 
the network is able to recognize the pre-learned conceptualized 
information and to predict the up-coming visual perception. 
The network also shows robustness and generalization abilities. 
Therefore, our approach offers preliminary concepts for a sim- 
ilar development of conceptualized language in pre-symbolic 
communication and further in infants' sensorimotor- stage 
learning. 
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